STA 208: Homework 3 (Do not distribute)¶

Due 06/02/2023 midnight (11:59pm)¶

Instructions:

  1. Submit your homework using one file name ”LastName_FirstName_hw3.html” on canvas.
  2. The written portions can be either done in markdown and TeX in new cells or written by hand and scanned. Using TeX is strongly preferred. However, if you have scanned solutions for handwriting, you can submit a zip file. Please make sure your handwriting is clear and readable and your scanned files are displayed properly in your jupyter notebook.
  3. Your code should be readable; writing a piece of code should be compared to writing a page of a book. Adopt the one-statement-per-line rule. Consider splitting a lengthy statement into multiple lines to improve readability. (You will lose one point for each line that does not follow the one-statementper-line rule)
  4. To help understand and maintain code, you should always add comments to explain your code. (homework with no comments will receive 0 points). For a very long comment, please break it into multiple lines.
  5. In your Jupyter Notebook, put your answers in new cells after each exercise. You can make as many new cells as you like. Use code cells for code and Markdown cells for text.
  6. Please make sure to print out the necessary results to avoid losing points. We should not run your code to figure out your answers.
  7. However, also make sure we are able to open this notebook and run everything here by running the cells in sequence; in case that the TA wants to check the details.
  8. You will be graded on correctness of your math, code efficiency and succinctness, and conclusions and modelling decisions

Exercise 1 (Probabilistic PCA and EM)¶

In Week 5-2, we learned the Probabilistic PCA model proposed by Tipping and Bishop (1999).

Consider the model $$ x_i = \mu + \theta w_i + \sigma \varepsilon_i, $$ where $\theta \in \mathbb{R}^{p \times r}$ is the loadings matrix (i.e., $\theta = VD$), $w_i \overset{\text{i.i.d}}{\sim} N(0, I_r)$, and $\varepsilon_i \overset{\text{i.i.d}}{\sim} N(0, I_p)$. For now, let's assume $\mu = 0_p$ for simplicity.

The likelihood function can be written as $$ \mathcal{L}(\theta, \sigma^2 | x_1, \dots, x_n) = \prod_{i=1}^n \int N(x_i | \theta w_i, \sigma^2 I_p) N(w_i | 0, I_r) dw_i, $$ and it's log-likelihood function is given by $$ \ell(\theta, \sigma^2 | x_1, \dots, x_n) = \sum_{i=1}^n \log \int N(x_i | \theta w_i, \sigma^2 I_p) N(w_i | 0, I_r) dw_i. $$ To derive the MLE, we will use the EM algorithm:

  1. Derive the expectation-step (E-step) of the EM algorithm

  2. Obtain the expressions of $\hat \theta$ and $\hat \sigma^2$ from the maximization step (M-step)

  3. Describe how you implement the algorithm (e.g., you can write a sudo code). How can you make sure $\langle \theta_j, \theta_k \rangle = 0$ for any $j \neq k$ in each iteration? ($\theta_j$ is the $j$-th column of $\theta$)?

1.¶

$$ \begin{aligned} log p(x, w) &= \sum_{i = 1}^n [\frac{1}{2\sigma^2}||x_i- \theta w_i||^2 - \frac{p}{2}log(2\pi\sigma^2) - \frac{1}{2} ||w_i||^2 - \frac{r}{2}log2\pi] \\ &= \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta w_i+w_i^T \theta^T w_i) - \frac{np}{2}log(2 \pi \sigma^2)- \frac{1}{2}\sum_{i = 1}^n w_i^Tw_i - \frac{nr}{2} log(2\pi) \\ Q(\theta, \sigma^2) &= \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2}log(2 \pi \sigma^2)- \frac{1}{2}\sum_{i = 1}^n E[w_i^Tw_i] - \frac{nr}{2} log(2\pi) \end{aligned} $$

2.¶

For $\theta$: $$ \begin{aligned} \frac{\partial Q}{\partial \theta} &= -\frac{1}{2\sigma^2} \frac{\partial Q}{\partial \theta} \sum_{i = 1}^n - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i] \\ &= \frac{1}{2\sigma^2} \sum_{i = 1}^n \frac{\partial Q}{\partial \theta}(- 2 x_i^T \theta E[w_i]+E[ ||\theta w_i||^2]) \\ 0 &= \frac{1}{2\sigma^2} \sum_{i = 1}^n (- 2 x_i^T E[w_i]+ 2 E[w_i^T \theta w_i]) \\ \sum_{i = 1}^n x_i^T E[w_i] &= \theta \sum_{i = 1}^n E[w_i^T w_i] \\ \theta &= \frac{\sum_{i = 1}^n x_i^T E[w_i]} {\sum_{i = 1}^n E[w_i^T w_i]} \\ \end{aligned} $$

For $\sigma^2$: $$ \begin{aligned} \frac{\partial Q}{\partial \sigma^2} &= \frac{\partial Q}{\partial \sigma^2} \left[ \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2}log(2 \pi) - \frac{np}{2}log(\sigma^2)- \frac{1}{2}\sum_{i = 1}^n E[w_i^Tw_i] - \frac{nr}{2} log(2\pi) \right] \\ &= \frac{\partial Q}{\partial \sigma^2} \left[ \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2}log(\sigma^2) \right] \\ &= \frac{\partial Q}{\partial \sigma^2} \left[ \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2}log(\sigma^2) \right] \\ 0 &= \left[ \frac{1}{2(\sigma^2)^{2}} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2\sigma^2} \right] \\ 0 &= \left[ \frac{1}{(\sigma^2)} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - np \right] \\ \sigma^2 &= \frac{\sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i])}{np}\\ \end{aligned} $$

3.¶

To implement the algorithm, we could do multiple iterations with different initlizations of our starting parameters. Our algorithm would be split into two parts, the E and M. Both would have their own respetive for loops to implement the steps that are outlined above. After each step with the parameters, we would then compare the likelihood of our new updated parameters compared to our old ones. If the difference is smaller than some chosen tolerance $\epsilon$, than we could conclude that our algorithm has converged.

In order to ensure orthonality, we could create a constaint in the form of a Lagrange multiplier at the end of the expectation step.

Exercise 2 (Gaussian mixture model and MCMC)¶

We have discussed how to estimate the Gaussian mixture model using EM in class. In this question, we explore the Bayesian finite Gaussian mixture model using MCMC.

Recall the finite Gaussian mixture model with $K$ components: $$ p(\boldsymbol x | \boldsymbol \pi, \boldsymbol \mu, \boldsymbol \sigma^2) = \prod_{i=1}^n \sum_{k=1}^K \pi_k N(x_i | \mu_k, \sigma_k^2), $$ where $\boldsymbol x = (x_1, \dots, x_n)$, $\boldsymbol \pi = (\pi_1, \dots, \pi_K)$ is a vector of mixing proportions, $\boldsymbol \mu = (\mu_1, \dots, \mu_K)$ and $\boldsymbol \sigma^2 = (\sigma_1^2, \dots, \sigma_K^2)$ are means and variances of Gaussian densities (here, $\mu_k$ and $\sigma_k$ are scalars).

In Bayesian setting, we introduce a latent variable $\boldsymbol z = (z_1, \dots, z_n)$, $z_j = k$, $k = 1, \dots, K$. Then, the model can be written as

\begin{align} & z_i | \boldsymbol \pi \sim \text{Multinomial}(\pi_1, \dots, \pi_K),\\ & x_i | z_i = k \sim N(x_i | \mu_k, \sigma^2_k). \end{align}

We consider the following conjugate priors:

\begin{align} & \boldsymbol \pi \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_K),\\ & \mu_k|\sigma_k^2 \sim N(0, \rho_k \sigma_k^2),\\ & \sigma_k^{-2} \sim \text{Gamma}(a, b), \end{align}

where $\alpha_1, \dots, \alpha_K, \rho_k, a, b$ are some hyperparamters which are fixed. Their values can be chosen by the user. For Dirichlet distribution, please check its density function here.

We will develop an MCMC algorithm for the model:

  1. Derive the posterior distribution $p(z_i = k |\boldsymbol \pi, \boldsymbol \mu, \boldsymbol \sigma^2, \boldsymbol x)$ for each $i = 1, \dots, n$, where $\boldsymbol z_{-i} = (z_1, \dots, z_{i-1}, z_{i+1}, \dots, z_n)$
  2. Derive the posterior distribution $p(\boldsymbol \pi | \boldsymbol x, \boldsymbol z)$ (hint: Dirichlet prior is conjugate with the multinomial distribution [see "Table of conjugate distributions"])(https://en.wikipedia.org/wiki/Conjugate_prior)
  3. Derive the posterior distribution $p(\sigma_k^{-2} | \boldsymbol x, \mu_k, \boldsymbol z)$ for each $k = 1, \dots, K$
  4. Derive the posterior distribution $p(\mu_k | \boldsymbol x, \boldsymbol z, \sigma_k^{2})$ for each $k = 1, \dots, K$
  5. Write a function, call it MCMCGaussianFiniteMixture, for the MCMC algorithm (choose $K = 3$).

The structure of your function should look like this:

  MCMCGaussianFiniteMixture(x, T = 5000, K = 3, put those hyperparameters here)

   # Note: T is the total iterations, K is the number of mixture components     
   {

        Step 1: choose initial values for parameters to start the algorithm

        Step 2: for t = 1, ..., T (total iteration) # start MCMC iterations

            Step 2.1 draw z_1, ..., z_n 

            Step 2.2 draw \pi_1, ..., \pi_K 

            Step 2.3: for k = 1, ... K:

                Step 2.3.1 draw \sigma_k^{-2} (or \sigma_k^2) 

                Step 2.3.2 draw \mu_k 

            End the for loop

        Collect values 

        End the for loop

        Output: sample draws of mu, sigma, pi
  }

To obtain draws from the Dirichlet distribution, use the function np.random.dirichlet

  1. Fit the mouse.csv data with the function MCMCGaussianFiniteMixture you wrote (choose $K = 3$, set the total iteration to be $5,000$ or larger. For hyperparameters, you can choose $\alpha_1, \dots, \alpha_K = 1, \rho_k = 1$, $a = 2, b = 2$; other values are possible as well),
    • a) plot the MCMC chains for $\pi_k$, $\mu_k$ and $\sigma_k$ for all $k = 1, \dots, K = 3$ (the title of each plot should include the corresponding variable name)
    • b) how many draws would you consider as burnin?
    • c) plot the ACF (autocorrelation) function for each chain after removing the burnin.
    • d) remove those draws you considered as burnin, and obtain the posterior means and 95\% credible intervals for each quantity. Compare those values with the MLE obtained using EM (i.e., fit the data using the GaussianMixture package). Comment on your findings (e.g., are those values from EM close to their posterior means? Do the 95\% credible intervals include those values from EM?)
    • e) (bonus point: 2 points) Using the MCMC draws you obtained to classify the mouse data (i.e., predict the mouse data and determine the label each data point belongs to). Plot the mouse data with the predicted labels using red color for Left Ear, green color for Right Ear, and blue for Head. Compare the plot with the one from EM (you can find the plot in the Week 5-2's lecture notes)

Note: BayesianGaussianMixture in sklearn here does not use the MCMC algorithm. It uses the so-called variational Bayes method, which has not been discussed yet.

Exercise 3 (Kernel density estimation)¶

The data set n90_pol.csv contains information on 90 British university students who participated in a psychological experiment designed to look for relationships between the size of different regions of the brain and political views.

The variables amygdala and acc indicate the volume of two particular brain regions known to be involved in emotions and decision-making, the amygdala and the anterior cingulate cortex; more exactly, these are residuals from the predicted volume, after adjusting for height, sex, and similar anatomical variables. The variable orientation gives the subjects’ locations on a five-point scale from 1 (very conservative) to 5 (very liberal). orientation is an ordinal but not a metric variable, so scores of 1 and 2 are not necessarily as far apart as scores of 2 and 3.

  1. Estimate the probability density for the volume of the amygdala. Plot it and report the bandwidth. Repeat this for the volume of the ACC.
  2. Estimate a joint probability density for the volumes of the amygdala and the ACC. What are the bandwidths? Are they the same as the bandwidths you got in problem 1? Should they be?
  3. Plot the joint density. Does it suggest the two volumes are statistically independent? Should they be? You may use three dimensions, color, contours, etc., for your plot, but you will be graded, in part, on how easy to read it is. (Hint: Remember that the random variables $X$ and $Y$ are statistically independent when their joint pdf is the product of their marginal pdfs, $p(x, y) = p(x)p(y)$. Think about what the product of your estimated pdfs from question 1 would look like.
In [ ]:
# 1
import pandas as pd
import numpy as np
from IPython.display import display
import scipy.stats as ss
import plotly.express as px 

import plotly.io as pio

# This ensures Plotly output works in multiple places:
# plotly_mimetype: VS Code notebook UI
# notebook: "Jupyter: Export to HTML" command in VS Code
# See https://plotly.com/python/renderers/#multiple-renderers
pio.renderers.default = "plotly_mimetype+notebook"

# Load the data
oi_brits = pd.read_csv("n90_pol.csv")

# Displays
display(oi_brits)
display(ss.gaussian_kde(oi_brits.amygdala).pdf(oi_brits.amygdala))

# Create the density and print the bandwidth
ss.gaussian_kde(oi_brits.amygdala).pdf(oi_brits.amygdala)
display(px.scatter(x = oi_brits.amygdala, y = ss.gaussian_kde(oi_brits.amygdala).pdf(oi_brits.amygdala), marginal_x = "histogram"))
print("The bandwidth is ", np.std(oi_brits.amygdala) * ss.gaussian_kde(oi_brits.amygdala).factor)
subject amygdala acc orientation
0 1 0.0051 -0.0286 2
1 2 -0.0674 0.0007 3
2 3 -0.0257 -0.0110 3
3 4 0.0504 -0.0167 2
4 5 0.0125 -0.0005 5
... ... ... ... ...
85 86 0.0174 -0.0242 2
86 87 0.0251 -0.0087 3
87 88 0.0676 0.0120 2
88 89 -0.0097 -0.0239 3
89 90 0.0374 0.0502 3

90 rows × 4 columns

array([12.19781554,  2.70213481,  7.68356675,  3.44448479, 11.03490713,
       12.60454171,  7.54193218, 12.57932449, 11.43798903,  2.67465351,
        3.81388624,  1.66957433, 12.5221318 , 11.74789421, 12.02985469,
        7.7528476 ,  5.70870721, 12.47813876, 12.47813876,  3.79260877,
       12.42605667, 11.90162535,  3.67591483,  8.07699268, 10.77466016,
       11.07070238, 11.96010531, 10.16116555,  3.77428528, 12.53381016,
        4.88587911,  4.04302659,  1.61323491,  9.58962624, 10.32554169,
        6.33905279, 12.38862504,  8.34598165,  8.71478225,  3.84088187,
       11.63228635,  9.18855761, 10.17941665,  7.20715352, 12.42605667,
        8.97565612, 11.06646711, 12.22701322,  6.94060474, 12.398562  ,
       11.97397281, 11.15978304,  7.09443158,  8.1140476 , 11.31763454,
        8.71478225, 12.24488676,  3.75648315, 12.39956902, 12.61569282,
        5.84134239, 12.32446606, 12.16102937,  7.82623488,  8.50625022,
       12.53978508,  3.40086792,  8.09690171, 12.59957911, 11.59241292,
        7.29008954, 12.40828129, 12.60101378,  8.52535839,  8.40411473,
       11.5721775 ,  3.90915262,  3.23732196,  8.03705373,  6.65078144,
       11.3399079 ,  3.7444858 ,  5.48892257,  8.15639009,  1.11126412,
       10.14291818,  8.73357019,  1.80721617, 11.85725938,  6.15145847])
The bandwidth is  0.013183004308375648
In [ ]:
display(ss.gaussian_kde(oi_brits.acc).pdf(oi_brits.amygdala))

# Creating the new density
ss.gaussian_kde(oi_brits.acc).pdf(oi_brits.acc)

# Displaying
display(px.scatter(x = oi_brits.acc, y = ss.gaussian_kde(oi_brits.acc).pdf(oi_brits.acc), marginal_x = "histogram"))
print("The bandwidth is ", np.std(oi_brits.acc) * ss.gaussian_kde(oi_brits.acc).factor)
array([1.55592973e+01, 9.73042096e-04, 1.07941930e+01, 2.87869209e+00,
       1.18306724e+01, 1.99726092e+01, 1.04468712e+01, 2.03210456e+01,
       1.28365274e+01, 8.90904886e-04, 7.85206738e-02, 1.88843695e-01,
       1.77286979e+01, 2.07381334e+01, 1.47985372e+01, 5.22115442e+00,
       5.63435870e+00, 2.08886723e+01, 2.08886723e+01, 6.77737741e-02,
       1.69156862e+01, 1.42970616e+01, 3.43284976e-02, 5.70636719e+00,
       1.85533730e+01, 1.19138632e+01, 2.10422533e+01, 9.95568766e+00,
       6.01285848e-02, 2.06532397e+01, 3.36745804e+00, 5.67171055e-01,
       1.38801349e-01, 1.54838379e+01, 1.02994084e+01, 7.36955838e+00,
       2.10956882e+01, 1.24031026e+01, 6.87204741e+00, 9.60594372e-02,
       2.05320044e+01, 1.44709591e+01, 9.99391830e+00, 9.61625855e+00,
       1.69156862e+01, 7.41189335e+00, 1.92830410e+01, 2.12097390e+01,
       4.32255272e+00, 2.10795618e+01, 1.45729334e+01, 1.95078801e+01,
       4.46109512e+00, 1.18410156e+01, 1.25179743e+01, 6.87204741e+00,
       1.58007878e+01, 3.04646008e+00, 1.67278219e+01, 1.93714835e+01,
       6.00695038e+00, 2.11703629e+01, 1.53802862e+01, 1.11422633e+01,
       6.46316271e+00, 1.79143509e+01, 1.03642868e-02, 5.73865350e+00,
       2.00640243e+01, 2.04557408e+01, 9.82349997e+00, 2.10623995e+01,
       1.88140544e+01, 6.49962537e+00, 1.25441292e+01, 1.32159729e+01,
       1.75198526e-01, 2.72708399e+00, 5.64246024e+00, 4.09578109e+00,
       1.99247634e+01, 5.00823442e-02, 3.52766681e+00, 5.83681667e+00,
       6.42947441e-03, 9.91743888e+00, 6.90998622e+00, 3.69901016e-01,
       2.09087611e+01, 3.79460104e+00])
The bandwidth is  0.008262433703070296

2.¶

In [ ]:
# Forming the joint density and finding the densities

jointKDE = ss.gaussian_kde([oi_brits.amygdala, oi_brits.acc]).pdf([oi_brits.amygdala, oi_brits.acc])
print("The bandwidth for the amygdala is", ss.gaussian_kde([oi_brits.amygdala, oi_brits.acc]).factor * np.std(oi_brits.amygdala))
print("The bandwidth for the acc is ", ss.gaussian_kde([oi_brits.amygdala, oi_brits.acc]).factor * np.std(oi_brits.acc))
The bandwidth for the amygdala is 0.015316368655217481
The bandwidth for the acc is  0.00959951750187294

Since the data, as we will explore in the next question further, is not independent, we expect the joint density bandwidths to differ from the marginal bandwidths.

3.¶

In [ ]:
# Plotting
display(px.scatter_3d(oi_brits, x = "amygdala", y = "acc", z = jointKDE), title="Joint Density")

We can see from the joint density that the joint density plot that the KDE is not symmetric. Rotating the plot, we can see the distrbution has some skew , thus showing us that these variables are not independent. This would explain also why the bandwidths of each variable from the joint density are different than that when we visualize the marginal distributions. If they were independent, we would expect marginal and seperate joint bandwidths to be identitical.